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1. INTRODUCTION 

Corporate social responsibility (CSR) has emerged as an essential tool in today’s dynamic business 
arena for addressing environmental issues and achieving long-term company growth. Several academics have 
proposed that CSR implemented by companies may increase sustainability as an emerging business strategy 
in complex competitive working environments and long-term growth and stability [1]-[4]. Since the rise of 
numerous CSR concerns, which have changed the nature, social, economic, and political aspects of 
organisations, CSR has emerged as a critical dimension for these entities to consider. According to Carroll 
[5], “Corporate social responsibility” is defined as “organisations’ economic, legal, ethical, and philanthropic 
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that society has at a given time” [6], [7]. Philanthropic corporate social responsibility (PCSR) is widely 
accepted as the core of CSR due to its strong connection to cause-related marketing, which plays a critical 
role in influencing consumer's attitude toward the brand [8], [9]. As a result, many recent studies have 
focused on PCSR [10]. PCSR is defined as "Corporate actions that contribute to society's expectation of a 
good corporate citizen” [11]. PCSR activities are essential to raise the company's profile and improve its 
overall results [12]. 

Moreover, United Nations (UN) development organisations and countries are continuously 
requesting that companies contribute more to sustainable development goals, such as poverty eradication. In 
this regard, the 2030 United Nations Sustainable Development Plan stresses companies’ commitment to 
tackle the issues of sustainable development, one of which is eradicating poverty and promoting fruitful 
public-private collaborations to achieve this goal [13], [14]. However, there is lacking attention from CSR 
efforts and sustainable management practises on societal issue of poverty reduction [15]. According to Musili 
[16], among the factors that influence poverty in Africa include societal and political greed, poor governance 
structures, spatial inequality during distribution of resources, unemployment, poor infrastructure, and 
political instability. To help reduce poverty, it requires strong institutions, non-marginalization of 
communities during resource allocation and transparency in government systems. However, it is not 
something that is easy to implement. The disparities can be seen in the distribution of resources by the 
government, agencies, individuals, and NGOs when a country is dealing with disaster. For example, 
Malaysia in early 2022, three states (Selangor, Pahang, and Melaka) were hit by extraordinary floods, 
destroyed properties and some even lost lives. This scenario attracted the attention of many parties to help 
provide donations. However, although various parties have come forward to provide help and donations, the 
distribution of donation is seen as inefficient and more focused on certain types of support. There was a 
dumping of donations of clothes and food in some Pusat Pemindahan Sementara (PPS) [17], [18]. Besides, 
according to the media there is help to the needy that does not reach as expected due to authorities’ failure to 
identify and oversee programme execution [19]. This problem is not a recent one, but it has been for a long 
time. 

Furthermore, CSR-related activities are often ad-hoc in nature, resulting in the inability to foresee 
what further actions or activities to perform in future. The allocation of the budgets to the needy is seen as 
untargeted, resulting in those who are not eligible to receive it. These occurs due to unstructured management 
and difficulties in categorizing CSR activities that must be carried out [20]. The critical information is 
frequently disguised by redundancy and noise and clustering the data according to comparable characteristics 
is one method for quickly summarising the data for further study [21]. Therefore, in this scenario, identifying 
categories and types of philanthropic activities are needed to effectively manage CSR activities. Activities 
identification will enable the corporation to properly plan its resources for the betterment of the society, also 
could provide companies or other bodies with a context and reference point in the analysis, making plans, 
and evaluation of PCSR activities. Hence, the pattern identification process can be automated through 
document clustering technique. Document clustering is utilised in a wide variety of domains, including 
information retrieval, social media analysis, neuroscience, image processing, text analysis, and 
bioinformatics [21]. 

To reduce poverty in Kenyan rural areas, Musili [16] employed clustering algorithms to categorise 
rural farming households as poor or non-poor. Rahman et al. [22] presents a B40 clustering-based k-means 
with cosine similarity architecture to discover the appropriate indicators and dimensions for data-driven 
multidimensional poverty index (MPI) measurement. The clustering algorithm identified eight groups within 
the B40 group, which may aid the government in appropriately identifying members of the B40 group who 
are financially burdened and may have been misclassified before. Using the multinomial Naive Bayes 
technique from text mining, Das et al. [23] examined CSR reports from automobile industries. The study was 
able to identify the areas in which a CSR report should focus. According to Chae and Park [24], “structural 
topic modeling (STM)”, a machine learning-based content analysis tool, is used to understand themes or 
subjects from CSR-related Twitter interactions. Taran and Mirkin [25] apply the popular approach of 
k-means clustering on the conventional Morgan Stanley Capital International (MSCI) database and focus on 
four primary dimensions of CSR: environment, social/stakeholder, labor, and governance. Castellanos et al. 
[26] utilizes text data mining (TDM) to analyse the contents of Corporate Social Responsibility (CSR) 
reports. This analysis considers 5 CSR dimensions that are based on the sustainability accounting standards 
board (SASB) (2014). 

Mishra et al. [27] applied k-means algorithm for clustering to identify the segments of the datasets. 
Besides, work by Sherkat et al. [28] proposed K-means algorithm to implement as a clustering method to 
identify the field of jobs people are most interested in and arrange for training sessions [29]. In addition, [30] 
used latent dirichlet allocation (LDA) approach for discovering topics in a textual corpus based on a semantic 
analysis of ‘big data software engineering (BDSE)’ job advertisements. Sutherland and Kiatkawsin [31] used 
LDA to extract latent topics from the corpus of Chinese short text. Thus, they were able to generate topic- 
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focused and less noisy features. For topic modeling and classification, [32] proposed the LDA and 
possibilistic fuzzy c-means (PFCM) methods. The proposed approach increased accuracy in twitter sentiment 
analysis by up to 3.5%. The study was done by [33] uses LDA and k-means clustering algorithm on Arabic 
text documents and the results reveal that the combined method results substantially better than the simple 
algorithm k-means. However, there is a limited number of the application of hybridization between topic 
modelling and document clustering being used in other domains, but not in CSR. 

Therefore, this research attempts to enhance the conventional technique by proposing an integration 
of document clustering and topic modeling approach in categorizing CSR activities. Moreover, the purpose 
of this study does not only identify the hidden pattern of CSR activities which will be used as a strategic 
approach in planning, but also the finding could assist the government to strategize and increase participation 
of companies and NGOs in CSR activities for poverty mitigation programs. In this paper, an explanation of 
document clustering and topic modeling is given. The detailed description of both techniques taken in this 
study is further elaborated. It then follows the results of the analysis and finally the discussion on the 
implications of future research. 


2. METHOD 

The proposed method consists of a few steps such as, dataset collection, text pre-processing, 
document clustering and topic modelling. As for the implementation of this experiment, RapidMiner has 
been used in this simulation. This tool is prominent as data mining software. The brief description about the 
proposed method is presented in these subsections. 


2.1. Dataset collection 

This experiment used a dataset gathered from the 3-year annual reports by the 25 CSR-award 
winning companies in Malaysia 2020, available on the data stream [34]. An annual report is a detailed 
description of the activities of a corporation during the previous year. Firms released the CSR reports in order 
to share their initiatives and outcomes on social responsibility. The dataset of the annual report from the year 
2017 to 2019 from each company was collected. Then, extracting unstructured data of the selected textual 
data based on CSR activities of each annual report was carried out and converted into a structured format. 
The documents were then collated and summarized for the cleaning process. The list of 25 CSR-Award 
winning companies is shown in Table 1. 


Table 1. List of CSR-award winning companies 
CSR-Award Winning Companies 


1 Tenaga Nasional Berhad (TNB) 10 Kumpulan Perangsang Selangor (KPS) 19 Kenanga Investment Bank 

2 KPMG 11 Exim Bank 20 Kuala Lumpur Kepong Berhad (KLK) 
3 Nippon Paint 12 Hong Leong Bank 21 Starbucks 

4 Bank Rakyat 13 Kwantas Corporation Berhad 22 Panasonic 

5 Pharmaniaga 14 Sunway Berhad 23 Poh Kong 

6 RHB Bank 15 Titijaya Land Berhad 24 LeapEd Sevices Sdn. Bhd. 

7 7-Eleven 16 Digi Telecommunications 25 YTL Corporation Berhad 

8 Great Eastern 17 BH Petrol 

9 Top Gloves 18 FedEx 


2.2. Text pre-processing 
Before the dataset is effectively analysed and mined, corpus documents must be noise-free. The 

phrase “corpus pre-processing” refers to the practise of deleting duplicate and less informative phrases from a 

corpus in order to establish a clean corpus. Data cleaning was performed by removing typographical error or 

validating and correcting values against a known list of entities. The pre-processing involves several 
processes to prepare the text data for qualitative analysis which are as follows: 

i) Normalization: the data should be normalized or standardized to bring all of the variables into 
proportion with one another. Non-numeric qualitative data should be converted to numeric quantitative 
data. 

ii) Tokenization: this stage divides the given text into smaller sections called sentences, and the phrases 
into smaller portions called tokens. White space or line breaks separate tokens. 

iii) Stop words removal: stop words are words that are not relevant to the desired analysis. Stop words, such 
as a, and are or do are eliminated as they are usually seen in the papers and do not provide any useful 
information. 
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iv) Stemming: Stemming is the process of reducing words to their original root. The aim of stemming is to 
limit the diversity in text data by transforming words into their common form. For example, 
“contributes”, “contributing” and “contributed” were converted to “contribute”. 

v) Building corpus: Each document is represented in the corpus by a sequence of pairings. The first digit 
of the pair conveys that the numeric ID relates to a word, while the second digit expresses how 
frequently that word occurs. For instance [(1,1), ...] where “1” refers to the word “Donation” (for 


example) and “1” refers to the amount of times in which the word happens in the texts. 


2.3. Document clustering 

Document clustering involves grouping the documents together which are similar to each other. 
Clustering is an approach of unsupervised machine learning that can be performed on unlabeled data. In this 
study, k-means algorithms are used to establish a set of k clusters and assign each document to a certain cluster. 
Based on Zainal et al. [35], in 1967, Macqueen first introduced the k-means clustering technique, which was a 
simple and unsupervised learning clustering approach. K-means is a partitioning clustering technique that is 
used to automatically cluster texts in a corpus to ensure that documents in one cluster are more similar than 
documents in other clusters. The similarity between examples is calculated based distance measure as: 
- Measure type: numerical measures 
- Numerical measure: cosine similarity 

Cosine similarity is often used in text analysis to measure document similarity [36]. It is measured 
by the cosine of the angle between two vectors which determines whether two vectors are pointing in roughly 
the same direction. All documents are assigned to their nearest cluster (nearest is defined by the measure 
type). Next, the centroids (determined by the position of the center) of the clusters are recalculated by 
averaging over all documents of one cluster. The previous steps are repeated for the new centroids until the 
centroids no longer move or max optimization steps are reached. Figure 1 shows the operators involved in 
RapidMiner studio to perform document clustering process. 
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Figure 1. Document clustering process in RapidMiner 


Based on Cheng et al. [37], the result of the document clusters is that a heterogeneous dataset is 
partitioned into several clusters with comparable members. However, the clusters created do not provide a 
description or characterization. Therefore, this study looks into topic modeling approach to further 
investigate the matter. 


2.4. Topic modeling 

Topic modeling seeks to locate a set of topics in a group of text documents; each topic is defined as 
a distribution across a set of words. This is performed by the application of statistical modelling techniques 
[33]. One of the most popular methods is LDA. LDA extends probabilistic latent semantic analysis (PLSA) 
by emplying Dirichlet priors for document-specific subject mixtures, hence generating previously overlooked 
documents. LDA been a tremendous success in text mining due its excellent generalisation and extensibility 
[30]. LDA is based on the idea that each text document is composed of a large number of subjects, each of 
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which contains countless words. The input required by LDA is merely the text documents and the expected 
number of topics. LDA uses Gibbs Sampling for the application of the model. 

LDA approach is used by randomly assigning a topic to each word in each document which involves 
75 documents in this study. Two frequencies can be computed based on the word distribution of all subjects 
namely topic-document and word-topic. The popularity of each topic in the document is measured by the 
frequency counts calculated during initialization (topic frequency) and a Dirichlet-generated multinomial 
distribution over topics for each document. Meanwhile, the popularity of the word in each topic is measured 
by the frequency counts calculated during initialization (word frequency), and a Dirichlet-generated 
multinomial distribution over words for each topic. Then, reassign the word to the topic with the largest 
conditional probability. Lastly, iterate the process until the word to the topic assignments become significant 
in a stable state. In RapidMiner, operator extract topic from data (LDA) is used to execute the process. 


3. RESULTS AND DISCUSSION 
3.1. Text pre-processing 

In this phase, the dataset was cleaned using the text-processing process which produced three 
numerical values for each document namely document length, token number and token length. This is 
derived from operators such as extract length, extract token number, and aggregate token length which are 
later used in the clustering process. The input and output of the text cleaning process are shown in Figure 2. 
From this process, wordlist, and word vectors for PCSR activities can be derived. 


COMPANY YEAR 


Row No. TEXT 
The position of the gample in the (filtered) view on the gees table.} 
inwversiti Tenaga Nasi l, We have been contributing towards the ... TNB 2019 


2 NET PROFIT ATTRIBUTABLE TO OWNERS OF THE COMPANY RM3,723.70 million. Joint ... TNB 2018 
3 NET PROFIT AT TRIBUTABLE TO OWNERS OF THE COMPANY RM6,904.0million. Total ... TNB 2017 
4 Legal entity for social enterprises In July 2020, the Dutch government announced a ne... KPMG 2019 
5 KPMG has committed itself worldwide to contributing to the 17 Sustainable Developme... KPMG 2018 
6 Jan Hommen Scholarship (founded in 2016) 8 students were awarded a EUR 2,500 sc... KPMG 2017 
7 In recent years, the three perspectives of Environment (E), Society (S), and Governance... NIPPON PAINT 2019 
8 Net income (100 million yen) 371. Contributions to local communities through table te... NIPPON PAINT 2018 
9 Net Income (billions of Yen) 36.01. In November 2016, we held a Global Quality Confe... NIPPON PAINT 2017 
10 Total Assets RM109.62 billion. To effect positive changes in society by providing Islam... BANK RAKYAT 2019 
11 Total Assets RM106.89 billion. Profit after Taxation and Zakat (RM1.76 Billion). Contri... BANK RAKYAT 2018 
12 Total Assets RM105.45 billion. Benefiting needy communities through zakat contributi... BANK RAKYAT 2017 
(a) 


Row No. Î text COMPANY YEAR document... token_num... 


TNB 2019 5514 881 
TNB 2018 3120 503 
TNB 2017 4190 710 
. KPMG 2019 1338 235 
KPMG 2018 676 112 
KPMG 2017 3269 557 
NIPPON PAINT 2019 6450 1099 
NIPPON PAINT 2018 4272 747 
NIPPON PAINT 2017 3454 585 
BANK RAKYAT 2019 11056 1857 
BANK RAKYAT 2018 3944 654 


BANK RAKYAT 2017 2712 453 


Figure 2. Data cleaning of (a) input process and (b) output process 


3.2. K-means clustering 

After performing text pre-processing, the word vector created was 7564. For clustering purpose, the 
attributes selected had numerical values such as document_length, token_length and token_number. This is 
because k-means requires a numeric measure to allow the algorithm in performing the distance measure. 
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Normalisation of data was then carried out to minimise the magnitude of the variable to avoid difficulties in 
the grouping or clustering process (refer Figure 3). 


Row No. text document... token_num.. token_leng... Row No. text document... token_num.. token_leng... 
1 univers univ... 17407 1761 9.885 1 univers univ... 0.527 0.437 2.139 
2 net net prof... 9854 1005 9.805 2 netnetprof... -0.169 -0.215 1.852 
3 net net prof... 13271 1419 9.352 3 net net prof... 0.146 0.142 0.225 
4 legal legal_e... 4239 469 9.038 4 legal legal_e... -0.687 -0.678 -0.903 
5 kpmg kpmg... 2131 223 9.556 5 kpmg kpmg... -0.882 -0.890 0.957 
6 Jan jan_hom... 10357 1113 9.305 6 jan jan_hom... -0.123 -0.122 0.057 
7 recent recen... 20436 2197 9.302 7 recent recen... 0.807 0.813 0.043 
8 net net_inco... 13555 1493 9.079 8 net net inco... 0.172 0.206 -0.757 
9 net net inco... 10937 1169 9.356 9 net net inco... -0.069 -0.074 0.238 
10 total total_a... 35015 3713 9.430 10 total totala... 2.151 2.122 0.506 
11 total totala... 12474 1307 9.544 11 total totala... 0.072 0.045 0.914 
12 total totala... 8579 905 9.480 12 total totala... -0.287 -0.302 0.682 
13 corpor corp... 15685 1635 9.593 13 corpor corp... 0.368 0.328 1.091 
(a) (b) 


Figure 3. The dataset condition, (a) before normalization process and (b) after normalization process 


Based on the data that has been normalised in Figure 3, the study needs to determine the number of 
clusters, K to be selected. Optimize parameters operator was then used to find the optimal value of K which 
executes subprocess of clustering and performance operators. K-means clustering technique is adopted to 
minimize the distance within clusters while it maximizes the distance between clusters. The result is displayed 
in Figure 4. With high dimensional data, it can be hard to know what is the “best” number of clusters. 
Therefore, this study adopted “elbow method” of cluster selection where the value of K=7 was selected. 


Avg. within centroid distance 


2 3 4 5 6 7 8 9 10 N R 13 14 15 16 17 18 19 20 
Clustering.k 


Figure 4. Scatter plot of k-means cluster 


However, based on [32], it was reported that the mechanism in existing clustering models is lacking 
in terms of identifying local topics and global topics. The clustering algorithm on text data is a difficult 
process, and it is also a difficult task to obtain accurate results from clustering over text data. According to 
Lu and Castka [38], the unlabeled clusters that result from the clustering process can be characterized and 
explained using the topic modeling methodology. Therefore, integrating topic modeling might offer a global 
picture of the themes and their differences, while simultaneously enabling a comprehensive analysis of the 
keywords most closely linked to each topic. 


3.3. LDA approach of topic modeling 
In Malaysia, the most popular form of CSR activity is currently donations, grants, education support 
and sponsorships [39]. However, most recent categories are hard to find; therefore, it further justifies the 
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need in carrying out this study. The categories are used to further clarify the topic discovered through LDA 
approach particularly the Gibbs sampling technique. The same value of K selected in document clustering is 
also used in determining the number of topics in this part. The probabilities of each word in a “topic” are 
sorted in the descending order. When taking a close look, the results generated by LDA provide much more 
concrete and specific meaning. Each document is usually assumed to be generated by a few of the total 
number of possible topics. So, every word in every document is assumed to be attributable to one of the 
document’s topics. 

Based on Table 2, it shows that the topic modelling has successfully found a set of topics in a group 
of text documents when applied to the entire dataset. LDA takes a termxdocument matrix as input and 
outputs the topic-word distribution using Gibbs Sampling approach. Then, the topic-word matrix contains the 
words that those topics can contain. Given the most common words in each topic, it is not hard to manually 
identify the names of each topic. Thus, from the five (5) top words in each topic, we can deduce the datasets 
related to the field of PCSR. As a result, the top five most frequent words are displayed in Table 2 which 
reflect the related concepts of each “topic”: Topic 0 is about community health and safety, Topic | is about 
education, Topic 2 is about employee welfare, Topic 3 is about social event, Topic 4 is about children 
development, Topic 5 is about local business development, and Topic 6 is about financial support. 


Table 2. Top 5 most frequent words for each topic identified 


Topic 0 ; Topic 2 A Topic 4 Topic 5 Topic 6 
: Topic 1 Topic 3 : i ; j 
Community health À Employee . Children Local business Financial 
Education h Social event 

and safety welfare development development support 
- Health - Educational - Employees - Programme - Group - Million - Bank 
- Community - Foundation - Management - Total - Children - Support - Sunway 
- Safety - Students - Work - Organised - Selangor - Business - Malaysia 
- Provide - Schools - Program - Students - Staff - Development - Year 
- Environment - YTL - Participated - Malaysia - Awareness - Local - Part 


Based on the findings, the type of PSCR activities is plotted to each CSR company as shown in 
Figure 5 which represents topic distribution of documents (companies). The findings further explained the 
type of PCSR activities carried out by each company for the year 2017-2019. For example, TNB was 
involved with PCSR activities specifically on social event and local business development for 3-year period. 
Meanwhile, Nippon Paint was mostly involved with education, children development and local business 
development activities for that period of time. This insight gives companies an option whether to continue 
with their current volunteering works or to venture out into different kinds of PCSR activities. On the other 
hand, the government can use this information to work together with companies who are interested in 
specific PCSR activities to mitigate poverty eradication. 


PCSR Activities 
Company 


TopicO | Topic1 | Topic2 | Topic3 | Topic4 | Topic5 | Topic6 


TNB x x 
KPMG x 
Nippon Paint | x x x 
Bank Rakyat x 
PharmaNiaga x x 
RHB Bank x 
7-Eleven x x x 
Great Eastern x 
Top Gloves x 
KPS x 


Exim Bank 


Hong Leong Bank x 


Kwantas Corporation Berhad x 


Sunway Berhad 


Titijaya Land Berhad x 


Digi Telecommuncations x x 


BH Petrol x 
FedEx x 


Kenanga Investment Bank 


KLK x 
Starbucks | 


Panasonic 


x |x |x| x | x 


Poh Kong 
LeapEd Services Sdn Bhd 


YTL Corporation Berhad 
Figure 5. Categories of PCSR activities by each company for the year 2017-2019 
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4. CONCLUSION 

This study is a preliminary investigation on the categorization of PCSR activities based on the 
annual report of selected CSR companies in Malaysia using an integrated approach of document clustering 
and topic modeling. Both approaches are a complementary technique which can minimize the limitation that 
each technique brings. As a result, the findings can categorize these activities into seven groups namely 
Community Health and Safety, Education, Employee Welfare, Social Event, Children Development, Local 
Business Development, and Financial Support. The findings can cluster PCSR activities that could assist 
companies and NGOs in managing funds distribution effectively based on the identified clusters that can be 
used in the government's intervention plan to solve poverty issue. In line with government programs, it is 
important to accurately predict allocations based on determined clusters because such knowledge may guide 
companies and practitioners in allocating the resources into those areas and populations where they are 
mostly needed. The findings from this study could provide a transparent distribution mechanism of corporate 
funds to benefit the community and the country, which can make a positive and significant impact on 
economic development both at the micro and macro levels. Moreover, the proposed hybrid model of both 
techniques does not only identify the hidden pattern of CSR activities that will be used as a strategic 
approach in planning, but the finding could also assist the government in strategizing and increasing 
participation of companies and NGOs in CSR activities for poverty mitigation programs. 

There are multiple areas that can be further explored in the future. In the current implementation, k- 
means was used for the document clustering method, but in the future, fuzzy type clustering can be used as a 
proposed method since there is a chance that the data obtained are inconsistent or missing. Also, because of 
cluster limits overlap, some kind of patterns may be described in a single cluster group or a different group. 
K-means may not be successful in locating overlapping clusters as it also fails to solve issues such as 
incomplete external knowledge. Therefore, fuzzy type clustering can be used to overcome this drawback for 
the better output. 
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